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Abstract — We consider a novel group testing procedure, termed 
semi-quantitative group testing, motivated by a class of problems 
arising in genome sequence processing. Semi-quantitative group 
testing (SQGT) is a non-binary pooling scheme that may be 
viewed as a combination of an adder model followed by a 
quantizer. For the new testing scheme we define the capacity 
and evaluate the capacity for some special choices of parameters 
using information theoretic methods. We also define a new class 
of disjunct codes suitable for SQGT, termed SQ-disjunct codes. 
We also provide both explicit and probabilistic code construction 
methods for SQGT with simple decoding algorithms. 

I. Introduction 

Group testing (GT) is a pooling scheme for identifying a 
number of subjects with some particular characteristic - called 
"positives" - among a large pool of subjects. The idea behind 
GT is that if the number of positives is much smaller than the 
number of subjects, one can reduce the number of experiments 
by testing adequately chosen groups of subjects rather than 
testing each subject individually. In its full generality, GT may 
be viewed as the problem of inferring the state of a system 
from a superposition of the state vectors of a subset of the 
system's elements. As such, it has found many applications in 
communication theory, signal processing, computer science, 
mathematics, biology, etc. (for example see |l |-|3l). 

Although many models have been considered for combi- 
natorial GT, two main models include the original model 
considered by Dorfman [41 (henceforth, conventional GT) 
and the adder model (also known as the adder channel or 
quantitative GT) |[T1. In the former case, the result of a test is 
an indicator determining if there exist at least one positive in 
the test (equal to if no positive in the test, and 1 otherwise), 
while in the latter case, the result of a test specifies the exact 
number of positives in that test. Motivated by applications in 
genome sequence processing, we propose a novel test model 
termed semi-quantitative group testing (SQGT). This model 
accounts for the fact that in most applications a test is not 
precise enough to exactly determine the number of positives, 
but it is more informative than a simple indicator of the 
presence of at least one positive. In other words, schemes in 
which results are obtained using a test device with limited 
precision may be modeled as instances of SQGT. 

We also allow for the possibility of having different amounts 
of sample material for different test subjects, which results in 
non-binary test matrices. Although binary testing is required 
for some applications - such as coin weighing - in other 
applications, such as conflict resolution in multiple access 
channel (MAC) and genotyping, non-binary tests may be used 
to further reduce the number of tests. In the former example. 



different non-binary values in a test correspond to different 
power levels of the users, while in the latter example, they 
correspond to different amounts of genetic material of dif- 
ferent subjects. The reason that non-binary tests are extremely 
important is that in applications like genotyping, tests are very 
expensive so that one may be inclined to reduce the number 
of tests at the expense of extracting more genetic material. To 
the best of the authors' knowledge, the only attempts of using 
non-binary adder schemes for MAC channels is limited to a 
handful of papers, including [51 and |6|. 

For the new and versatile model of SQGT with Q-ary test 
results and q-ary test sample sizes, Q, q 5= 2, we define the 
concept of capacity. Furthermore, we define a new general- 
ization of the family of disjunct codes, first introduced in |71, 
called "SQ-disjunct" codes. Similar to the family of disjunct 
codes, this new code family may be decoded using a simple, 
low-complexity algorithm. We conclude our exposition with 
a novel probabilistic method for code construction, of use in 
applications where the physics of the experiments prohibits 
structured codes. 

The paper is organized as follows. Section |ll] describes 
the SQGT model, while Section III introduces the capacity 
of SQGT. In Section IV we define SQ-disjunct codes and 
present some simple properties of these codes. In Section [V] 
we describe a number of constructions for SQGT codes. 

II. Semi-quantitative Group Testing 

Throughout this paper, we adopt the following notation. 
Bold-face upper-case and bold-face lower-case letters denote 
matrices and vectors, respectively. Calligraphic letters are used 
to denote sets. Asymptotic symbols such as ~ and o(-) are 
used in a standard manner. For an integer fc, we define 
[k\ := {0,1,--- 

Let N denote the number of test subjects, and let m denote 
the number of positives. Also, let u denote an upper bound 
on the number of positives (i.e. m ^ u). Let Si denote the 
i* subject, i e {1,2,- •• ,N}, and let S^. = Dj be the j* 
positive, j e {1, 2, • • • , m}. Furthermore, let T) denote the set 
of positives, so that = m. We assign to each subject a 
unique q-ary vector of length n, termed the "signature" or 
the codeword of the subject. Each coordinate of the signature 
corresponds to a test. If e [q]" denotes the signature of 
the i* subject, then the fc* coordinate of may be viewed 
as the "amount" of Si (i.e. sample size, concentration, etc.) 
used in the fc* test. Note that the symbol indicates that Si is 
not in the test. For convenience, we refer to the collection of 
codewords arranged column-wise as the test matrix or code. 



The result of each test is an integer from the set [Q]. 
Each test outcome depends on the number of positives and 
their sample amount in the test through Q thresholds, rji 
(I e {1,2,- ■■ ,Q}). More precisely, the outcome of the fc* 
test, j/fc, equals 



Vk 



if 



(1) 



^ Xk,ij < Vr+l 

i=i 

where Xk,i is the fc* coordinate of Xj , and 770 = 0. Based 
on the definition, it is clear that SQGT may be viewed as a 
concatenation of an adder channel and a decimator (quantizer). 
Also, if q = Q = 2 and rji = 1, the SQGT model reduces 
to conventional GT. Furthermore, if Q — 1 = m{q — 1) and 
Vr e [Q], j]r = r, then the SQGT reduces to the adder channel, 
with a possibly non-binary test matrix. Note that in this model, 
we assume that r/q > {q — l)u. 

Of special interest is SQGT with a uniform quantizer - 
i.e. SQGT with equidistant thresholds. In this case, rjj. = rrj, 
where r e [Q + 1], and the following definition may be used 
to simplify ([T]i. 

Definition 1: The "SQ-sum" of s 5= 1 codewords x 
1 < j ^ s, denoted by y = @ 
is a vector of length n with its i 



j^l Xj = Xl © X2 © • • 

* coordinate equal to 



6 

• © Xs 



+ Xl 



(2) 



Here, "+" stands for real-valued addition. 
Using this definition, ([T]i reduces to 

m 

y = ®x,^. (3) 

where x^^ is the signature of the positive. 

III. Capacity OF SQGT 

It is well-known that group testing may be viewed as a 
special instance of a multiple access channel (MAC) (e.g. 
see 19J). Using this connection, Malyutov |8|, D'yachkov l9\, 
and Atia et al. ifTOl derived information theoretic necessary 
and sufficient conditions on the required number of tests for 
conventional GT. It is tedious, yet straightforward, to show 
that the model described in JS], Q, ifTO l may also be used to 
evaluate SQGT schemes. The main difference in the analysis 
of GT and SQGT arises due to the different forms of the 
mutual information used to express the necessary and sufficient 
conditions. We therefore focus on characterizing the mutual 
informations arising in the SQGT framework. Our notation 
follows the setup of ifTOl . 

Let the sample amount of each subject in each test be 
chosen in an i.i.d manner from a q-ary alphabet, according 
to a distribution Pt- Also, let Vl and V2 be disjoint 
partitions of the set of positives, V, such that = i and 



\V. 



'1 I — w — i; we denote by jSj^ the set of all possible pairs 
{V't^.vf). For a single test, we define y as the test result, 
and tp^ (where j = 1, 2) as a vector of size 1 X |2?|''|, with 
its fc* entry equal to the sample amount of the fc* positive of 
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) and their corresponding and t, 
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Fig. 1. One choice of (©f , V^^' 
in a binary test design for m = 5. 

their corresponding vectors t^' and t^^' for the case where 
m = 5 and q = 2. 

By following the same steps as in [ 10|, it can be shown that 
for any fixed m, if the number of tests n satisfies 



n > 



y) 



i = 1,2,--- ,m, (4) 



then the average probability of error asymptotically approaches 
zero. In this equation, /(t^' ; tp' , y) stands for the mutual 
information between t^j and (t^^,?/). Note that since the 
sample amounts of the subjects are chosen independently 
and identically, the value of I(t^^;t^\y) does not depend 
on the specific choice of Similarly, a necessary 

condition for zero average error probability for SQGT is 

log rr^) 



n > 



Definition 2: [Asymptotic capacity of SQGT channel] Using 
Q and (|5|, we define the asymptotic capacity of the chan- 
nel corresponding to the SQGT scheme (henceforth, SQGT 
channel) as 

C = s\rpp^.^a{m,PT,ri), 

where 



1,2, 



(5) 



(6) 



a{m,PT,r]) 



mm 
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and where 77 is a vector of length Q with rjk its fc* entry. 

In certain applications, r] may be determined a priori by the 
resolution of the test equipment. In such applications, the only 
design parameter to optimize is Pt- On the other hand, if one 
is able to control the thresholds, rj becomes a design parameter 
and clearly exhibits a strong influence on the capacity of the 
test scheme. 



Define the rate of a group test as R 



IorN 



The next 



Vf^ in the test. Fig. 1 shows a choice of ^) and ated as foUows. Let Wj, j = 1,2, denote the /i-norm of t^ 



l{2} ^{2} 



theorem clarifies the use of the term "capacity" in Definition |2] 

Tfieorem 1: For a SQGT channel, all rates bellow capacity 
C are achievable. In other words, for every rate R < C, 
there exists a test design for which the average probability of 
error converges to zero. Conversely, any test design with zero 
achieving average probability of error must asymptotically 
satisfy R < C. 

Proof: The proof is omitted due to the space limitation. 

■ 

The mutual information /(t^|;t^^,y) in (|6]l may be evalu- 
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Fig. 2. Numerically obtained lower bound for SQGT with q - 
values of m. 



Then, 



/sQ(t«;t«,y) = Hiy\t}^>J-Hiy\t}^[Xvl) = H{y\t}^l) 

Q-l (m-i)(q-l) 

= -Ti Y^P{W2 = k)P{y = l\W2 = k)logP{y = l\W2 = k). 

1=0 k=0 

On the other hand V/ e [Q], 

p(y = i\w2 = k) = Pirn -k^Wi< m+i - k) (7) 

otherwise 

where Pwi (wi) is the probabiHty mass function (PMF) of Wi 
and can be found using 

PwAwi) = Prih) * Prfe) * • • • * Pt{U), (8) 
where "*" denotes convolution. Similarly P{W2 = k) can be 
found using 

PWAW2) = Prih) * PT{t2) * • • • * Pritm^,)- (9) 
Note that when q = 2, 

P{y = l\W2 = k) (10) 



3 for different 



Jj=max(0,i)i-fc) \j/ 





otherwise 



and 



PiW2 = k) 



m — I 
k 



1—i — k 



(11) 



where p is the probability that a subject is present in a test. 

In order to obtain further insight into the behavior of SQGT 
channels, we evaluated ^ numerically using a simple search 
procedure that allows us to quickly determine a lower bound 
on the capacity. Fig. 2 shows the obtained lower bound on the 
capacity when q = 3, and Q = 2 or Q = 3. Table |l] shows 
one set of probability distributions and thresholds achieving 
this bound for Q = 3. 

Table [l] reveals an interesting property of the quantizers 
found through numerical search: there exists at least one 
quantization region that consists of one or two elements 
only. What this finding implies is that in order to reduce the 
number of tests as much as possible, a sufficient amount of 
qualitative information has to be preserved. For example, by 
having a quantizer that assigns the value v only to inputs of 
value V, allows for resolving a large amount of uncertainty. 



TABLE I 

A SET OF PROBABILITY DISTRIBUTIONS AND THRESHOLDS 
CORRESPONDING TO Q = 3 IN FiG. 2. 



m 


Pt 


quantizer 


2 


[0.33 0.34 0.33] 


{0, 1U2}{3,4| 


3 


[0.43 0.46 0.11] 


{0,1|{2|{3,4,5,6} 


4 


[0.18 0.64 0.18] 


{0,1,2,3}{4}{5,6,7,8} 


5 


[0.15 0.70 0.15] 


{0,1}{2,3,4,5}{6,7,8,9,10} 


6 


[0.60 0.07 0.33] 


{0,1,2,3}{4,5}{6,7,--- ,12} 


7 


[0.34 0.25 0.41] 


{0,1,--- ,6}{7,8}{9,10,--- ,14} 


8 


[0.10 0.80 0.10] 


{0,1,--- ,7}{8}{9,10,--- ,16} 


9 


[0.82 0.09 0.09] 


{0,1,--- ,8}{9}{10,11,--- ,18} 


10 


[0.58 0.28 0.14] 


{0,1,--- ,4}{5,6}{7,8,-- - ,20} 



Furthermore, the most informative input, left unaltered after 
quantization, corresponds to a statistical average of the input 
symbols, reminiscent to the centroid of a quantization region. 
These findings will be discussed in more detail in the full 
version of the paper. 

IV. Generalized Disjunct and Separable Codes 
FOR SQGT 

Disjunct codes were first introduced in |7| for efficient zero- 
error group testing reconstruction. In what follows, we define 
a new family of disjunct codes suitable for SQGT that shares 
many of the properties of binary disjunct codes. 

Definition 3: The syndrome of a set of vectors {x^}, i e 
{1, 2, • • • , s}, such that e [(/]", is a vector y e [Q]" equal 
to y = ®J=iXj . 

Definition 4: A set of codewords X = {xi, X2, • • • , x^} with 
syndrome y^ is said to be included in another set of codewords 
Z = {zi, Z2, • • • , Zt} with syndrome y^, if Vi e {1, 2, • • • , n}, 
Ux ^ Vz ■ We denote this inclusion property by A" <l Z or 
equivalently y^ < yz . 

Remark 1: By this definition, it can be easily verified that if 
A" c Z, then X <iZ. 

Note that for q = 2, this definition is equivalent to the 
definition of inclusion for conventional GT, defined in |7|. 

Definition 5: A code is called a [q; Q; rj; ii]-SQ-disjunct code 
of length n and size N if \fs,t ^ u and for any sets of g-ary 
codewords X = {xi, X2, • • • , x^} and Z = {zi, Z2, • • • , Zt}, 
X <l Z implies A c Z. 

Henceforth, we focus on the case where the thresholds are 
equidistant. We call such codes [q; Q; ij; u]-SQ-disjunct codes. 

Proposition 1: A code is [q; Q; rj; u]-SQ-disjunct if and only 
if no codeword is included in the set of u other codewords. 

Proof: It is easy to verify that if a code is [q; Q; rj; u]-SQ- 
disjunct, then no codeword is included in the set of u other 
codewords. 

Conversely, let X = {xi,X2,--- ,x<;} and Z = 
{zi,Z2,--- ,Zf} be two sets of codewords where s,t ^ u. 
From the assumption that no codeword is included in the set 
of u other codewords, one can conclude that no codeword is 
included in the set of t other codewords when t ^ u. If X <l Z 



but X % Z, then there exists a codeword Xj (j' e {1, 2, • • • , s}) 
such that {xj} $ Z. But since {xj} < A" < Z, then {xj} <] Z, 
which contradicts the assumption that no codeword is included 
in t other codewords. ■ 

Remark 2: From Proposition [T| one can conclude that a 
code is [q; Q; ?7; w]-SQ-disjunct if and only if for any set of 
u + 1 codewords, {xi,X2, • • • ,x„+i}, there exists a unique 
coordinate k{i) in each codeword x^, for which 





> 




(12) 
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By unique coordinate, we mean that k{i) ^ k{j), if i ^ j. 
Consequently, a necessary condition for the existence of a 
[g; Q; 7^; u]-SQ-disjunct code is that g — 1 ^ rj. As a re- 
sult, there exist no binary [2; Q; 77; it]-SQ-disjunct code when 
r] > I. 

Proposition 2: Any code generated by multiplying a conven- 
tional binary u-disjunct code by g — 1, where g — 1 > 771, is a 
[q; Q; rj; wJ-SQ-disjunct code. As a result, the rate of the best 
[q; Q; rj; wj-SQ-disjunct code is at least as large as the rate of 
the best binary u-disjunct code with the same size and length. 

Our interest in SQ-disjunct codes lies in their simple de- 
coding procedures, of complexity 0{nN). However, one can 
construct codes for SQGT using other GT codes, such as 
binary w-separable codes for conventional GT |7| or codes 
designed for the adder channel |11|. It can be shown that any 
of these two family of codes can be multiplied by 77 to form 
a code for SQGT. 

Remark 3: The constructions described in this section reveal 
the following, and highly intuitive fact: the number of indi- 
viduals that may be successfully examined with Q-ary SQGT 
may be as large as the number of individuals that may be 
tested under the adder channel model, provided that one is 
allowed to pool different amounts of sample material in each 
test. In other words, the rate of adder and SGQT channels 
may be the same, despite the loss of information induced by 
the quantizer, provided that the alphabet size of the latter 
scheme is sufficiently larger than the alphabet size of the 
former scheme. 

V. Code Construction for SQGT 

In what follows, we discuss two approaches for constructing 
SQGT codes. For simplicity, we focus on SQGT codes with 
equidistant thresholds. The first approach relies on classical 
combinatorial methods, while the second approach, usually not 
encountered in coding theory, relies on probabilistic methods. 
The second approach is of special interest for applications 
such as genotyping, where one cannot arbitrarily choose the 
test matrices. The tests are usually determined by the physics 
of the experiment, and only certain statistical properties of 
the tests are known. In this scenario, "structure" is to be seen 
as probabilistic trait. We show that one way to approach this 
problem is to characterize the number of tests that ensures that 
almost all members of a code family possess a given trait and 
act as SQGT codes. 



A. Combinatorial Construction 

Fix a binary u-disjunct code matrix Cf, of dimensions 
rif, X Ni,, with code-length nt and Ni, codewords. Let 
K = I log,, f f^^l (m — 1) + 1^ j ; construct a code of length 



KNh by concatenating K matrices. 



, Ck], where Cj 



yj-l 
/-ii=0 



u rj 



'q-l 

. ^ , 

n = Tif, and size N 
C = [Ci, C2, • • • 
1 J K. 

Theorem 2: Let the concatenated code C be as described 
above. The code is capable of uniquely identifying up to u 
positives. 

Proof: The proof is based on exhibiting a decoding proce- 
dure and showing that the procedure allows for distinguishing 
between any two different sets of positives. The decoder is 
described below. 

Let y be the Q-ary vector of test outcomes, or equivalently, 
the syndrome of the positives. For a rational vector z, let [zj 
and (z) denote the vector of integer parts of z and fractional 
parts of z, respectively. If m = 1, decoding reduces to finding 
the column of C equal to r/y. If u > 1, decoding proceeds as 
follows. 

Step 1: Set y'^ = y and form vectors y^, 1 j K, 
using the rules: 



and 



u-J - 1 
u-1 

V? - 1 



u- 1 
- 1 

u- 1 



(13) 



Step 2: Identify the positives as follows: if the syndrome 
of a column of Cj is included in y^, declare the subject 
corresponding to that column positive. Declare the subject 
negative otherwise. 

The result is obviously true for u =\. Therefore, we focus 
on the case u > 1. First, using induction, one can prove that 
each yj , 1 < j ^ K, is the syndrome of a subset of columns of 
Cj corresponding to positives. Let = [Ci, C2, • • • , Cj], 
where \ ^ j ^ K. Since the non-zero entries of C are 
multiples of 77, 77y is the sum of columns of C corresponding 
to a subset of positives. Also, the maximum value of the entries 
of C^_]^ equals 77 " . Since there are at most u positives, 
the maximum value of their sum does not exceed 7? " ~" . 

This bound is strictly smaller than V u-i ^ '■^^ minimum non- 
zero entry of Ck- As a result, yx is the syndrome of the 
positives with signatures in Ck, and y'k-i is the syndrome of 
positives with signatures in C'j^_^. Similarly, it can be shown 
that \fj, 1 ^ j ^ K — 1, y-j is the syndrome of the positives 
with signature in Cj, and yj_i is the syndrome of the positives 
with signatures in C' _j^. 

From Proposition I2] we know that each Cj is a [q; Q; 77; u\- 
SQ-disjunct code. Consequently using step 2, one can uniquely 
identify the positives with signatures from Cj. ■ 

Remark 4: The method described above can be used with any 
binary separable code for conventional GT or adder channel 
to generate a SQGT code. 



Remark 5: All the constructions described in this paper are 
able to identify up to u positives in a pool of N subjects 
when u « N . However, when u iV, one can construct 
non-binary codes with length n and asymptotic size of ~ 
([log[^^JJ + logn/2)n (see the full version of this paper). 

B. Probabilistic Construction 

We consider the following problem: find a critical rate such 
that any randomly generated g-ary code with rate less than the 
critical rate is a \q\ Q: rj; u]-SQ-disjunct code with probability 
close to one. Based on the critical rate, which depends on 
the statistical properties of the process used to generate the 
codes, one can identify the smallest number of tests required 
to ensure that any code in the family may be used for SGGT. 

Theorem 3: Let identical = ^ - Ve > 0, where 



I <L± 



7 

and / = [^^J- Any g-ary code of length n and size A^ with 
rate asymptotically satisfying R ^ -Rcritkai is a [g; Q\ 77; u]-SQ- 
disjunct code with probability at least 1 — e. 

Proof: Let C be a code of length n and size A^, and let M 
be a set of u+1 codewords of C. There are L = different 
ways to choose A^. For the choice of 7M, we define Ei as 
the event that the syndrome of at least one of the codewords in 
Al is included in the syndrome of the other u codewords. By 
this definition, each Ei is mutually independent of all the other 
events, except for d = [{.^^^ - {^^^It^^) - l) of them. 
Suppose that P{Ei) ^ p' for all i. Using the high probability 
variation of Lovasz local lemma lfT2ll . if 

p'^t(^-t)\ (15) 



L\ LJ ' 

then P ^nf=i > 1 — where E^ is the complement of 
the event E^. In other words, C is a [g; Q; 77; u]-SQ-disjunct 
code with probability at least 1 — e. 

From the definition of Ei, one has PiEi) = °. ^ , 

where a is the number of q-ary matrices of size ?i x (u + 
1) that do not satisfy ^V2\ and have distinct columns; also, 
(u + is the total number of ?i x (u + 1) matrices with 

distinct columns. In order to find an upper bound on a, we use 
the fact that a matrix that satisfies ( [T2] l has distinct columns. 
Consequently, 

(u + l)!^'^^^^ -a = g"("+i)-5 (16) 

where is the number of q-ary matrices with (possibly) 

repeated columns, and h is the number of such matrices that 
satisfy ^V2\ . It can be easily seen that, 

b^{u+l)c (17) 

where c is the number of g-ary matrices that do not contain 

a row, X, satisfying [^J > \_-^^ — ^J- On the other hand, 
c = A" where A is the number of "acceptable" g-ary rows of 
length Let x e [q]""''^ denote an acceptable row. If Ai, 

« 6 {1, 2, • • • , [^^J}, denotes the number of acceptable rows 
with the first entry xi from {i-q, ir]+ 1, • • • , («+ l)r/ — 1}, then 



A = Ai. Let / = [^-^J. If « < /, there are 7] choices for 



Xi; if « = /, we have {q — Irf) choices for xi. The number of 
ways to choose the rest of the entries (denoted by Bi) is 

'V^ fk + u-l\ fm + u-l\ 



fc=0 



where f*^^", ^] counts the number of non-negative integer 



solutions to 2 

/-I 



u+l 

J =2 



k. Consequently, 



iri + u — l\ , ^ , f In + u—l 
+ ((7-/77) 
u J \ u 



(19) 



■(/-l)77+u 



u+l 
Using these results. 



\u+lj \ u 



g"("+l)-(7.+ l)A" 



(20) 



\u+lJ 



Note that the second inequality does not loosen the bound 



significantly since 1 — 



As 77, N ~* 00, p 



I (u + l)A" 



as 77 



00. 



and L 



Also, fjd 

d _ », 



' = o(l) and therefore (l - f 

Consequently, ( [T5] l asymptotically simplifies to 

^ logAf ^ e{u + l) ^ o{l/N) ^ bg7 ^ log(eu!) 



itical 



77A^ ■ 77(7i+l) U+l 77(7^+1) 

log 7 _ log(l/e) 

77+1 77(77 + 1) 

where 7 = q^^+^^/A. This completes the proof. ■ 
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